Skeleton of the MMHC algorithm: The skeleton of a Bayesian network learned with the MMHC algorithm

Description

The skeleton of a Bayesian network learned with the MMHC algorithm.

Usage

mmhc.skel(x, method = "pearson", max_k = 3, alpha = 0.05,
robust = FALSE, ini.stat = NULL, R = NULL, parallel = FALSE)

Value

A list including:

ini.stat: The test statistics of the univariate associations.
ini.pvalue: The initial p-values univariate associations.
pvalue: A matrix with the logarithm of the p-values of the updated associations. This final p-value is the maximum p-value among the two p-values in the end.
runtime: The duration of the algorithm.
ntests: The number of tests conducted during each k.
G: The adjancency matrix. A value of 1 in G[i, j] appears in G[j, i] also, indicating that i and j have an edge between them.

Arguments

x: A numerical matrix with the variables. If you have a data.frame (i.e. categorical data) turn them into a matrix. Note, that for the categorical case data, the numbers must start from 0. No missing data are allowed.
method: If you have continuous data, this "pearson". If you have categorical data though, this must be "cat". In this case, make sure the minimum value of each variable is zero. The function "g2Test()" in the R package Rfast and the relevant functions work that way.
max_k: The maximum conditioning set to use in the conditional indepedence test (see Details). Integer, default value is 3.
alpha: The significance level (suitable values in (0, 1)) for assessing the p-values. Default value is 0.05.
robust: Do you want outliers to be removed prior to applying the PCHC algorithm? If yes, set this to TRUE to utilise the MCD.
ini.stat: If the initial test statistics (univariate associations) are available, pass them through this parameter.
R: If the correlation matrix is available, pass it here.
parallel: Set this to TRUE if you have millions of observations. In that instance it can reduce the computational time by 1/3.

Author

Michail Tsagris.

R implementation and documentation: Michail Tsagris mtsagris@uoc.gr.

Details

The max_k option: the maximum size of the conditioning set to use in the conditioning independence test. Larger values provide more accurate results, at the cost of higher computational times. When the sample size is small (e.g., \(<50\) observations) the max_k parameter should be 3 for example, otherwise the conditional independence test may not be able to provide reliable results.

References

Tsamardinos, I., Aliferis, C. F. and Statnikov, A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673--678). ACM.

Brown, L. E., Tsamardinos, I. and Aliferis, C. F. (2004). A novel algorithm for scalable and accurate Bayesian network learning. Medinfo, 711--715.

Tsamardinos I., Brown E.L. and Aliferis F.C. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1): 31--78.

Examples

Run this code

# simulate a dataset with continuous data
x <- matrix( rnorm(300 * 30, 1, 100), nrow = 300 )
a <- mmhc.skel(x)

Run the code above in your browser using DataLab